Skip to content

Add LLM inference support to JMLC API#2430

Draft
kubraaksux wants to merge 8 commits intoapache:mainfrom
kubraaksux:llm-api
Draft

Add LLM inference support to JMLC API#2430
kubraaksux wants to merge 8 commits intoapache:mainfrom
kubraaksux:llm-api

Conversation

@kubraaksux
Copy link

Summary

Adds LLM text generation to JMLC API using Py4J to bridge Java and Python (HuggingFace models).

Changes

  • Connection.java: loadModel() to start Python worker
  • PreparedScript.java: setLLMWorker(), generate()
  • LLMCallback.java: Java interface for Python callback
  • llm_worker.py: Python worker loading HuggingFace models
  • JMLCLLMInferenceTest.java: Integration test

Test

mvn test -Dtest=JMLCLLMInferenceTest -pl .

WIP - looking for feedback on the approach.

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath)
- Connection.java: Removed findPythonScript() method
- LLMCallback.java: Added Javadoc for generate() method
- JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()
- Connection.java: Auto-find available ports for Py4J communication
- Connection.java: Add loadModel() overload for manual port override
- Connection.java: Use destroyForcibly() with waitFor() for clean shutdown
- llm_worker.py: Accept python_port as command line argument
Move worker script from src/main/python/systemds/ to src/main/python/
to avoid shadowing Python stdlib operator module.
- Add generateWithTokenCount() returning JSON with input/output token counts
- Update generateBatchWithMetrics() to include input_tokens and output_tokens columns
- Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py
- Check Python process liveness during startup instead of blind 60s timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant